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Abstract 


We  show  that  recent  results  in  [3]  on  risk  bounds  for  regularized  least-squares  on 
reproducing  kernel  Hilbert  spaces  can  be  straightforwardly  extended  to  the  vector¬ 
valued  regression  setting.  We  first  briefly  introduce  central  concepts  on  operator¬ 
valued  kernels,  then  we  show  how  risk  bounds  can  be  expressed  in  terms  of  a  gen¬ 
eralization  of  effective  dimension. 
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1  Introduction 


This  work  presents  an  extension  to  multi-task  learning  of  our  recent  results  [3]  on  risk 
estimates  for  regularized  least-squares  (RLS)  with  reproducing  kernel  Hilbert  spaces 
(RKHS).  Recently  various  papers,  [18],  [1] ,  [10]  ,[8] ,  have  addressed  the  problem  of 
multi-task  learning  using  kernel  techniques.  For  instance  [18]  employs  two  kernels 
one  on  the  inputs  and  one  on  the  outputs,  in  order  to  represent  similarity  measures 
on  the  corresponding  spaces.  The  underlying  similarity  measures  are  supposed  to 
capture  some  inherent  regularity  of  the  phenomenon  under  investigation  and  should 
be  chosen  according  to  the  available  prior  knowledge.  On  the  contrary  in  [1]  the  prior 
knowledge  is  coded  by  a  single  kernel  on  the  space  of  input-output  couples,  and  a 
generalization  of  standard  support  vector  machines  is  proposed.  It  was  in  [10],  [8] 
that  for  the  first  time  in  the  learning  theory  literature  it  was  pointed  out  that 
particular  scalar  kernels  defined  on  input-output  couples  can  be  profitably  mapped 
onto  operator-valued  kernels  [2]  defined  on  the  input  space. 

It  is  well  known  [2]  that  the  machinery  of  RKHS  can  be  elegantly  extended  to  cope 
with  vector- valued  functions  using  operator- valued  kernels,  so  one  would  expect 
that  kernels  methods  for  single-task  learning  can  be  adapted  to  multi-task  learning 
in  this  extended  RKHS  framework.  In  fact  we  show  that  the  risk  bounds  obtained 
in  [3]  can  be  straightforwardly  rephrased  in  this  more  general  setting. 

The  paper  is  organized  as  follows.  In  sections  2  we  recall  very  briefly  the  main 
concepts  of  statistical  learning  theory  with  vector-valued  outputs  and  define  the 
RLS  algorithm  in  this  framework.  In  section  3  we  fix  the  notations  extending  the 
familiar  formalism  of  reproducing  kernel  Hilbert  spaces  to  the  operator-valued  case. 
We  also  introduce  the  assumptions  on  the  hypothesis  space  and  on  the  probability 
measure  from  which  the  samples  are  drawn.  Furthermore  we  prove  some  preliminary 
results  on  the  structure  of  RLS  estimators  and  concentration  of  measure  for  vector 
valued  random  variables.  Finally  in  section  4  we  prove  the  probabilistic  upper  bound 
for  the  excess  risk  of  RLS  estimators  using  a  generalized  effective  dimension. 


2  Learning  from  examples 


We  first  briefly  introduce  some  basic  concepts  of  statistical  learning  theory  in  the 
regression  setting  for  vector- valued  outputs  (for  details  see  [16],  [9],  [13],  [4],  [10] 
and  references  therein). 

In  the  framework  of  learning  from  examples  there  are  two  sets  of  variables:  the  input 
space  X  and  the  output  space  Y  which  we  will  assume  to  be  a  separable  Hilbert 
space.  The  relation  between  the  input  x  £  X  and  the  output  y  £  Y  is  described  by 
a  probability  distribution  p(x,y )  =  v(x)p{y\x)  on  X  x  Y,  where  v  is  the  marginal 
distribution  on  X  and  p(-\x)  is  the  conditional  distribution  of  y  given  x  £  X.  The 
distribution  p  is  known  only  through  a  sample  z  =  (x,  y)  =  ((xi,yi), . . . ,  (xg,yg)), 
called  training  set,  drawn  independently  and  identically  distributed  (i.i.d.)  according 
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to  p.  Given  the  sample  z,  the  aim  of  learning  theory  is  to  find  a  function  fz  \  X  —>  Y 
such  that  fz(x)  is  a  good  estimate  of  the  output  y  when  a  new  input  x  is  given.  The 
function  /z  is  called  estimator  and  the  map  providing  /z,  for  any  training  set  z,  is 
called  learning  algorithm. 

Given  a  function  /  :  X  — ►  Y,  the  ability  of  /  to  describe  the  distribution  p  is 
measured  by  its  expected  risk  defined  as 


I[f]  =  [  \\f{x)  ~y\\Y2  dp(x,y), 

JXxY 


and  the  regression  function 


fP{x)=  /  ydp(y\x), 

Jy 

is  the  minimizer  of  the  expected  risk  over  the  space  of  all  the  measurable  Y-valued 
functions  on  X.  In  this  sense  f p  can  be  seen  as  the  ideal  estimator  of  the  distribution 
probability  p.  However,  the  regression  function  cannot  be  reconstructed  exactly  since 
only  a  finite,  possibly  small,  set  of  examples  z  is  given. 

To  overcome  this  problem,  in  the  framework  of  the  regularized  least  squares  algo¬ 
rithm  [17],  [12],  [4],  [20],  a  Hilbert  space  7i  of  real  functions  on  X  is  fixed  and  the 
estimator  /ZA  is  defined  as  the  solution  of  the  regularized  least  squares  problem, 


Y™{jJ2\\f(xi)-yi\\Y2  + 

i— 1 


n 


(i) 


where  A  is  a  positive  parameter  to  be  chosen  in  order  to  ensure  that  the  discrepancy. 


nx 


is  small  with  hight  probability.  Since  p  is  unknown,  the  above  difference  is  studied 
by  means  of  a  probabilistic  bound  B(X,£,rj),  which  is  a  function  depending  on  the 
regularization  parameter  A,  the  number  (.  of  examples  and  the  confidence  level  1  —  77, 
such  that 


P 


inf /[/]  <  B(X,  £,  rj) 
jErt 


>1  —  rj. 


In  particular,  the  learning  algorithm  is  consistent  if  it  is  possible  to  choose  the 
regularization  parameter,  as  a  function  of  the  available  data  A  =  A(£,  z),  in  such  a 
way  that 


lim  P 

l— >+00 


I[fzX{e’z)] 


inf/[/|>e 


=  0, 


(2) 


for  every  e  >  0.  The  above  convergence  in  probability  is  usually  called  (weak)  consis¬ 
tency  of  the  algorithm  (see  [5]  for  a  discussion  on  the  different  kind  of  consistencies). 
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3  Notations  and  preliminary  results 


In  this  section  we  state  the  notations,  we  set  the  main  assumptions  and  we  prove 
some  preliminary  results. 

We  assume  that  the  input  space  X  is  a  Polish  space  and  the  output  space  Y  is  a 
real  separable  Hilbert  space.  We  let  Z  be  the  product  space  X  x  Y,  which  is  a  Polish 
space  too.  The  assumptions  on  X  and  Y  will  avoid  measurability  problems. 

The  space  of  bounded  linear  operators  on  Y  with  the  uniform  norm  ||'||£(7^)  will  be 
denoted  by  C{Y),  and  C2 (Y)  will  be  the  separable  Hilbert  space  of  Hilbert-Schmidt 
operators  on  Y  with  scalar  product 

(A,B)C2{y)  =  Tt(B*A) 

and  norm 

\\Mc2(Y)  =  V^(A*A)  >  \\a\\c{h), 

where  TV  denotes  the  trace  and  *  the  adjoint  (similar  notation  we  use  by  replacing 
Y  with  other  Hilbert  spaces). 

We  first  discuss  the  assumptions  on  the  space  TL. 

Hypothesis  1  The  space  TL  is  a  separable  Hilbert  space  with  reproducing  kernel 

K  :  X  x  X  —>  jC2(Y)  C  ay) 


such  that 


X  x  X  3  (x,  t)  ( K(x,t)v,w)Y  is  measurable  \/v,w  £  Y 
\\K{xix)\\c2{Y)  ^  K  Vx  £  X 

for  some  k  >  0. 


(3) 

(4) 


We  recall  that  TL  is  a  real  Hilbert  space  of  functions  f  :  X  —>  Y  satisfying  the 
following  reproducing  property 

f{x)  =  K*f  f  £  TL,  x  £  X,  (5) 

where  Kx  :  Y  — >  TL  is  the  bounded  operator 


Kxv  =  K(-,x)v  v  £  Y. 


and  (5)  gives 

K;Kx  =  K(t ,  x)  £  C2{Y)  Vx,  t  £  X. 
Moreover,  given  x  £  X  the  operator 

Tx  =  KXK*  £  jC2(H), 


(6) 

(7) 

(8) 
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is  a  positive  Hilbert-Schmidt  operator  and  (8)  ensures 


H-^CII,C(W)  —  WTx\\c2(H)  —  \\K(x,x)\\c2(Y)  —  K-  (9) 

If  Y  =  M,  the  space  C-2(Y)  reduces  to  M,  Kx  £  TL,  and  K*f  =  (/,  Kx)n,  so  that  TL  is 
the  reproducing  kernel  Hilbert  space  with  kernel  K  [2] .  The  theory  can  be  extended 
to  vector  valued  functions  [14] .  In  particular  the  space  TL  is  uniquely  defined  by  its 
kernel  in  the  sense  that,  given  a  kernel  K  :  X  x  X  — >  C(Y)  such  that 


{K(x,  t.)v,  w)Y  =  (K(t,  x)w ,  v)Y 

n 

(K(xi,  Xj)vj,  Vi)Y  >  o 

i= 1 

there  is  a  unique  Hilbert  space  of  functions  /  :  X  — »  Y  satisfying  (5). 

The  assumption  that  the  kernel  K  takes  values  in  the  Hilbert  space  £2 (Y)  C  C(Y) 
simplifies  the  theory  and  is  enough  for  our  purposes. 

The  condition  that  TL  is  separable,  which  is  essential  in  the  following,  is  not  ensured 
by  the  assumptions  on  the  kernel  K.  However,  if  (3)  is  replaced  by  the  stronger 
condition 


X  x  X  3  (x,  t)  1— >  ( K(x,t)v,w)yr  is  continuous  \/v,w  G  Y, 

the  fact  that  X  and  Y  are  separable  imply  that  TL  is  separable,  too. 

As  shown  by  Proposition  2  below,  (3)  and  (9)  are  the  minimal  requirement  to  ensure 
that  any  f  &TL  has  a  finite  expected  risk  for  all  probability  measure  satisfying  (10). 

Let  now  p  be  a  probability  measure  on  Z.  By  px  we  will  denote  the  marginal 
distribution  of  p  on  X  and  by  p(-\x)  the  conditional  distribution  on  Y  given  i£l, 
both  existing  since  Z  is  a  Polish  space  (see  Teo  10.2.2  [6]).  Let  L2(Z,p,Y)  be  the 
Hilbert  space  of  functions  <p  :  Z  — >  Y  that  are  square-integrable  with  respect  to  p. 
and  denote  by  ||-||  and  (-,  •)  the  corresponding  norm  and  scalar  product.  Similar 
notation  we  use  for  L2(X,  px,Y). 

Since  p  is  a  a  probability  measure,  L2(X ,  px,Y)  can  be  regarded  as  a  closed  subspace 
of  L2(Z,  p,Y)  and  the  corresponding  orthogonal  projection  Q  is 

{Q(f){x)  =  J ^  4>(x,  y)  dp(y\x )  0  <G  L2(Z,  p,  Y). 

Finally,  the  expected  risk  with  respect  to  p  of  a  measurable  function  /  :  X  — »  Y  is 

J[/]  =  J  \\f{x)-y\\Y2  Mx,y)- 

We  are  now  ready  to  state  the  hypothesis  on  p. 

Hypothesis  2  The  probability  measure  p  on  Z  satisfies 
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(10) 


2  dp(x,y)  <  +oo 
dpx(x)]  <  +oo, 


lx 


and  there  are  ffi  E  TL  and  M  >  0  such  that 


(11) 


I[fn}=  inf  /[/] 
fen 

\y-fn{x)\2<M  &.s. 


(12) 

(13) 


If  (10)  is  not  satisfied,  then  /[/]  =  +oo  for  all  /  £  hi  and  learning  problem  does  not 
make  sense.  If  it  holds  /[/]  is  finite  for  all  /  E  L2(X ,  px,Y). 

In  general  ffi  is  not  unique  as  an  element  of  TL,  but  two  different  solutions  are  equal 
almost  everywhere,  see  (17)  below. 

If  the  regression  function 

fP  =  J  ydp{y\x)  =  Qy 

belongs  to  Tt,  clearly  fa  =  fp.  However,  in  general  the  existence  of  fa  is  a  weaker 
condition  than  f p  E  TL,  for  example,  if  Tt  is  finite  dimensional  fa  always  exists. 
Proposition  1  will  prove  that  the  integral  in  (11)  always  converges  to  a  positive 
Hilbert-Schmidt  operator  T,  see  (15)  below,  so  (11)  states  that  T  is  in  fact  trace 
class.  Condition  (11),  (12)  and  (13)  are  needed  to  prove  the  upper  bound  (28). 

We  now  study  some  mathematical  properties  of  the  expected  risk  and  of  the  regu¬ 
larized  least  square  algorithm. 

Let  A  :  TL  — >  L2(Z,  p,Y)  be  the  linear  operator 

(Af)(x,  y)  =  K*f  V(x,y)€Z. 

Equation  (5)  implies  that  the  action  of  A  on  an  element  /  is  simply 

( Af){x,y )  =  f(x), 

that  is,  A  is  the  canonical  inclusion  of  Tt  into  L2(Z,  p,Y),  where  the  variable  y  is 
dumb  and  functions  are  identified  p-almost  everywhere.  So  A  could  be  not  injective 
and  Tt  could  be  not  closed  in  L2(Z,  p,Y),  since  ||/||w  is  different  from  ||/||  . 

The  main  properties  of  the  operator  A  are  summarized  in  the  following  proposition. 

Proposition  1  If  Tt  satisfies  Hypothesis  1  and  p  is  a  probability  measure,  A  is  a 
bounded  operator  from  TL  into  L2(Z,  p,Y),  the  adjoint  A*  :  L2(Z,  p,Y)  — >  TL  is 

A*<f>=  [  Kx  4>(x,  y)  dp(x,  y)  =  [  Kx  (Q4>)(x)  dpx{x),  (14) 

J  z  Jx 

where  the  integral  converges  in  TL,  and  A*  A  is  the  Hilbert-Schmidt  operator  on  TL 

T  =  /  Tx  dpx(x),  (15) 

Jx 
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(16) 


where  the  integral  converges  in  L^^hC),  and 

II^II^CH)  -  Ill'll C2{H)  —  K- 

PROOF.  The  proof  is  standard  for  Y  =  M  and  it  can  easily  extended  to  the  vector 
case. 

First  we  prove  that  any  function  /  G  hi  is  measurable  and  bounded.  Since  Y  is 
separable,  it  is  enough  to  prove  that  the  function 

x  (f(x),v)Y  =  (f,Kxv)n 

is  measurable  for  all  v  G  Y.  If  /  =  Kt.w  for  some  t  G  X  and  w  G  Y,  the  claim  follows 
by  (7)  and  (3).  Since  (5)  ensures  that  the  set 

{Ktw  1 1  G  X,  weY} 

is  total  in  hi,  the  measurability  for  arbitrary  /  follows  by  density.  Finally,  (9)  gives 

ll/(x)lly  =  <  \\f\\n  1 1 -^3;  1 1  £(?-{)  ^  Kll/llft  • 

Since  p  is  a  probability  measure,  then  /  G  L2(Z,  p,Y)  and  A  is  a  linear  operator 
from  hi  to  L2(Z,p,Y)  with  ||7l/||p  <  i/k||/||w,  so  that  A  is  bounded. 

We  now  prove  (14).  Indeed,  given  (f>  G  L2(Z,  p,Y),  (3)  ensures  that  the  function 

(x,y)  ( Kx(f>(x,y)J)n  =  ((/>(x,y),f{x))n 

is  measurable  for  all  /  G  hi.  Since  TL  is  separable,  the  map 

Z  3  (x,  y)  Kx(j)(x,  y)  SH 

is  measurable  as  map  taking  values  in  hi  and  (4)  gives 

\\Kx4>(x,y)\\n  =  sj (K*Kx(j)(x,  y),cj)( x,  y))Y  <  yfn\\4>{x,y)\\Y 

for  all  (x,  y )  G  Z.  Since  p  is  finite,  4>  is  in  T1(Z,  p,  Y)  and,  hence,  (x,  y)  *— >  Kx(j)(x,  y) 
is  integrable,  as  a  vector  valued  map.  Finally,  for  all  /gH, 

J  {Kx(j)(x,  y),f)H  dp(x,  y)  =  (</>,  Af)p  =  ( A* q i,  f)H  , 

so  the  first  part  of  (14)  holds  and  the  second  one  is  a  consequence  of  the  definition 
of  Q. 

Reasoning  as  above,  it  follows  that  the  map 

X  3  x  i — ■>  Tx  G  C,2^hC) 

is  integrable  as  function  taking  values  in  /^(Tf).  In  particular,  T  G  £2 0~L)  and  (15) 
is  a  consequence  of  (14).  The  bound  (16)  follows  from  (9). 


The  role  of  the  operator  A  in  the  context  of  learning  theory  is  clear  observing  that 
for  all  /  G  hi 

/[/]  =  \\Af  -  y\\p2  , 

where  y  denotes  both  the  variable  and  the  function  (x,  y)  1— >  y,  which  belongs  to 
L2(Z,p,Y)  by  (10).  So  the  following  result  holds. 
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Proposition  2  If  Hypotheses  1  and  (10)  hold,  fn  &  M  is  a  minimizer  of  the  ex¬ 
pected  risk  /[•]  if  and  only  if  it  satisfies 


Tfn  =  A*y.  (17) 

and 

m  -  nfn]  =  \\A(f  -  fn)  11/  =  |  Vf(f  -  fn)\\n2  VfeH.  (18) 

Moreover,  for  A  >  0,  a  unique  minimizer  fx  of  the  regularized  expected  risk 

m  +  m\n 

exists  and  is  given  by 

fx  =  (T  +  \)-1A*y  =  (T+\)~1Tfn.  (19) 

PROOF.  The  result  is  well  known  in  the  framework  of  linear  inverse  problems  [7] 
and  we  report  it  for  completeness.  Since  the  expect  risk  is  convex,  fn  is  a  minimizer 
if  and  only  if  the  derivative  of  /[/]  is  zero,  that  is, 

(Af,Afn-y)p  =  0  V/eH  (20) 

and  (17)  follows. 

Given  /  £  H 

I[f]-I[fn\  =  \\Af-y\\p2-\\Afn-y\\p 2 

=  \\A(f  -  fn)  11/  +  2 (A(f  -  fn),Afn  -  y)p 

=  \\A(f-fn)  11/ 

since  the  second  term  is  zero  due  to  (20).  Let  A  =  U\/T  be  the  polar  decomposition. 
Since  U  is  a  partial  isometry  from  the  closure  of  the  range  of  VT  onto  the  closure 
of  the  range  of  A 

\\A(f-fn)\\p=  \\Vf(f  -  fn)\\n. 

Finally,  (19)  follows  taking  the  derivative  be  equal  to  zero. 

Clearly,  A,  T ,  fn  and  /A  depend  on  p  and,  if  it  is  needed,  we  write  explicitly  this 
dependence. 

In  particular,  given  z  =  (x,  y)  =  ((aq,  y\ ), . . . ,  ( xt ,  yi))  E  Z  ,  consider  the  empirical 
measure 

1  i  i 

Pz  =  J  iPz)x  =  J  SXi, 

i=l  i= 1 

where  b(x,y)  is  the  Dirac  measure  at  point  (x,  y )  E  Z.  Since  pz  is  finitely  supported, 
any  element  w  E  L2(Z,pz )  is  uniquely  defined  by  l  vectors 


Wi  =  w(xj, yf)  E  Y  i  =  l,..., I 


with  the  condition  that  Wi  =  Wj  whenever  ( x^yi )  =  ( Xj,yj )  and  the  scalar  product 
becomes 

1  e 

(w’w'Wpz)  =  jJ2(wi’wi-)Y 

i=  1 

In  the  following  we  let  Az  =  APz,  Tx  =  TPzx  and  fzx  =  /Az. 

Since  pz  has  a  finite  support,  (10)  reduces  to  the  condition  y  G  L2(Z,pz ),  which  is 
clearly  satisfied,  so  Propositions  1  and  2  become 


(Azf)i  =  K*.f  =  f(xi)  Vi  =  (21) 

1 

Azvr  =  -  ^  KXiWj  weL2(Z,pz)  (22) 

i= 1 

i  .  ^ 

:=  ^4Z  Az  =  -  TXi  (23) 

i=  1 

AMTx  +  A)-1^,  (24) 


where  /ZA  is  the  unique  minimizer  of  the  regularized  empirical  error 

1  £ 

||/(®i)  -  yi\\Y2  +  M\f\\n2 ■ 

i= 1 

The  following  technical  lemma  will  be  used  in  the  proof  of  Theorem  5. 

Lemma  3  If  p  satisfies  (11),  i.e.  T  is  trace  class,  then  Tx  is  trace  class  for  px- 
almost  all  x  G  X . 


PROOF.  Let  (efc)fc>i  be  a  basis  of  hi.  For  all  k  >  1  the  functions 

are  positive  and  meausurable  by  (3)  and 

:&k ,  ek) 'f-i  dpx(x)  —  E((/x  Tx  dpx(x )^  ek,e^J  <  Tr(T)  <  +oo. 

Clearly  Ylk=i  i^xek,  ^k)n  convergences  to  Tr  Tx ,  which  is  finite  for  almost  all  x  &  X 
by  monotone  convergence  theorem. 


Finally,  we  need  the  following  probabilistic  inequality  due  to  [11]. 

Proposition  4  Let  (Ll,lF,P)  be  a  probability  space  and  £  be  a  random  variable  on 
12  taking  value  in  a  real  separable  Hilbert  space  1C.  Assume  that  there  are  two  positive 
constants  H  and  a  such  that 
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a.s 


(25) 

(26) 


UH\ \k<Y 

nml]<v2- 


Let  i  G  N  and  0  <  rj  <  1 ,  then 


(wi, . . . ,  o^)  G  n 


2=1 


/C 


.  H  a  .  ,  _ 

<2|T+7||1og, 


>1  —  77. 

(27) 


PROOF.  It  is  just  a  restatement  of  Th.  3.3.4  of  [19]  (see  also  [15]).  Consider  the 
probability  space  and  the  set  of  independent  random  variables  with 

zero  mean  fi(wi, . . .  ,u>£)  =  £(tUj)  —  E[£]  defined  on  flf.  The  fact  that  are  i.i.d  and 
conditions  (25),  (26)  ensure  that 


Ui\\iC<H  a  .s 

n\ml\< *2, 

so  that,  for  all  m  >  2  it  holds 


EE[liedlx]  <  ^m\B2H' 


2  TjTTL  —  2 


i=  1 


with  B 2  =  7<j2.  So  Th.  3.3.4  of  [19]  can  be  applied  and  it  ensures 


£(«■=<)  -  £[«]) 


2=1 


> 


xB 


<  2  exp  — 


x 


2(1  +  xlCB 


for  all  x  >  0.  Letting  5  =  we  get  the  equation 


1  M  2  1  _  tt2(j-2  _  ^2 

2^iT  1  +  15HB- 2  “  2(1  +  (liLCT-2)  “  °g  g' 


since  7>2  =  7<r2.  Defining  7  =  8 Ha  2 

&r2  72  .  2 

2R2  1  +  7  “  °S^' 

The  inverse  of  the  function  j—  is  the  function  g[t )  =  1(7  +  \/72  +  47)  so 


i= 1 


< 


7C 


a2  f2H'2 


log 


with  probability  greater  than  1  —  g.  The  thesis  follows  observing  that  g(t)  <t+\/t 


and  2  log  4  >  J 2  log  4  >  1. 
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4  Risk  bound 


The  aim  of  this  section  is  to  give  a  probabilistic  upper  bound  on  the  expect  risk  of  the 
solution  given  by  the  regularized  least  square  algorithm.  The  bound  depends  on  the 
number  of  examples  l,  the  regularization  parameter  and  some  a  priori  information 
on  the  probability  distribution  p. 

In  the  following,  we  assume  that  the  space  hi  and  the  probability  distribution  p 
satisfy  Hypotheses  1  and  2,  we  fix  a  parameter  A  >  0  and  we  define 


(1)  the  residual 

A(X)=  fx  -  fn  2=  v/T(/a  -  fH) 

p 

where  T  is  given  by  (15),  fx  by  (19)  and  fn  by  (12); 

(2)  the  reconstruction  error 


H 


B(X)  =  \\fx-fn 


H 


(3)  the  effective  dimension 

Af(X)  =  Tr[(T  +  A)_1T], 
where  the  trace  is  finite  due  to  (11). 


In  the  framework  of  learning  .4(A)  is  called  approximation  error,  whereas  in  the 
framework  of  approximation  theory  \J B( A)  is  the  approximation  error.  To  avoid 
confusion  we  follow  the  notation  of  inverse  problems. 


We  are  now  ready  to  state  our  main  result  on  the  upper  bound. 


Theorem  5  Let  z  e  Z1'  be  a  training  set  drawn  i.i.d  according  to  p  and  fzx  6  H  the 
corresponding  estimator  given  by  (24)-  With  probability  greater  than  1 — 77,  0  <  p  <  1, 


I[fzx]-I[fn}<  Cv 


Al(A)  + 


k2B(  A)  kAI(A) 

i2x  +  ex 


kM  MAf(X)  \ 

+  1 2a  +  e  ) 


provided  that 

Z  >  ^  max(jV(A),  ^2 /Cv) 

where  C v  =  1281og2(8/r/). 


(28) 


(29) 


PROOF.  We  split  the  proof  in  several  steps.  Let  A,  p  and  t  as  in  the  statement  of 
the  theorem. 

Step  1  :  Given  a  training  set  z  =  (x,  y)  6  Z(' ,  (18)  gives 


I[UX]  -  I\fn } 


Vr(fzx  -  fn) 
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As  usual, 

/zA  ~fn  =  (/zA  -  /A)  +  (/A  -  fn) 

and  (19),  (24)  give 


/zA  -  /A  =  ((rx  +  A)_1Az*y)  -  ((T  +  A)_1A*y) 

=  (Tx  +  A)’1  {(Az*y  -  A*y )  +  (T  -  TX)(T  +  A)_1A*y} 

(  Eq.  (17)  )  =  (Tx  +  A)’1  { (Az*y  -  Tx/W  +  Txfn  -  Tfn )  +  (T  -  Tx)/A} 

=  (Tx  +  A)’1  (Az*y  -  TxfH)  +  (Tx  +  X)-\T  -  Tx)(/A  -  fH). 

The  inequality  ||/i  +  /2  +  /3II?/  <  3(||/i||w2  +  ||/2 IIt^2  +  ||/3||w2)  implies 

/[/zA]  -  I[fn\  <  3  (v4(A)  +  Si  (A,  z)  +  S2(A,  z))  (30) 


where 

Si( A,  z)  =  |  VT(TX  +  A)’1  (Az*y  -  Tx/W) 

S2(A,  z)  =  ||  V/T(TX  +  A)-X(r  -  Tx)(/a  -  fn) 
Step  2:  probabilistic  bound  on  S2(A,z).  Clearly 


n 


H 


S2( A,z)<  VT(TX  +  A)-1 
Step  2.1:  probabilistic  bound  on 


C(H) 
v/T(Tx  +  A)"1 


(E-Tx)(/a-/w) 


AW) 


.  Assume  that 


0(A,z)  =  ||(T  +  A)-1(E-Tx)| 
then  the  Neumann  series  gives 


1 

\c(H)  -  2’ 


(31) 


(32) 


Vf(Tx  +  A)"1  =  Vf(T  +  A)_1(I  -  (T  +  A)_1(T  -  Tx))_1 

+oo 

=  Vf(T  +  A)"1  ((T  +  A)_1(T  -  Tx))n 

n=0 

so  that 


+oo 


^(rx  +  xn\cm  -  II ^(T  +  a)-1IL™  E IKT  +  *)-1(r  -  T,)| 


C(H) 


n= 0 


I  AW) 


< 


where,  by  spectral  theorem, 


2\/A  1  —  @(A,z)  ’ 

VT(T  +  A)'1  ^  Inequality  (32)  now  gives 


Vr(Tx  +  A)-1 


<  \/X 


(33) 
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We  claim  that  (29)  implies  (32)  with  probability  greater  than  1  —  ij.  Indeed,  let 
£i  :  X  — >  C-2(T~L)  be  the  random  variable 

£i(z)  =  (T  +  A)-1^. 

Bound  (9)  and  ||(T  +  A)_1||£^  <  j  imply 

KlUafH)  s  ||(r  + IKIIcrfH)  -  X  =  T' 

Lemma  3  ensures  that  Tx  is  trace  class  for  almost  all  x  and  the  inequality 

Tr(AB)<\\A\\c{n)TrB  (34) 

(A  positive  bounded  operator,  B  positive  trace  class  operator)  implies 


Tr  T, 


ix 


T£  (T 


A  )■ 


dpx{x) 


<  J  \\Tx\\c{H)  Tr  ((T  +  A)-2^)  dpx(x) 

(  (9)  )  <  k  Tr  ((T  +  A)_2T) 

=  kTt  ((T  +  A)"1  ((T  +  A)-2T(T  +  A)-3)) 
<K||(T  +  A)-1||£(w)Tr((T  +  A)-1T) 

<  jm)  =  a?. 


Observing  that 


E[£i]  =  T(T  +  A)"1  -  J2  Uxi)  =  (T  +  A)_1TX, 

1  i=  1 

Proposition  4  applied  to  £i  gives 

||(T  +  A)-1TJt-r(r  +  A)"1  ||&(„,  <  21og(6/„) 


with  probability  greater  than  1  —  r//3.  Then  for  all  £  G  N  satisfying  (29) 


,  ,,  J  2 k  //JV(A)\  1  1  1 

l0g<6/")(A?+V—  j-8  +  8-4 


so  that 


1 


0(A,z)  <  ||(T  +  A)"1rx-T(T  +  A)-1||£2(w)  <  - 


(35) 


with  probability  greater  than  1  —  rj/3. 

Step  2.2:  probabilistic  bound  on  ||(T  —  Tx)(/A  —  fn)\\jr^ny  Let  £2  :  X  — +  TL  be  the 
random  variable 

Ux)  =  Tx(fx  -  fH) 
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Bound  (9)  and  the  definition  of  £>( A)  give 


Il6(*)llw  < 


fn 


h2 

2 


Since  Tx  is  a  positive  operator 


E[||6H n2}=Jx  (rxT£(fx  -  /W),rj(/A  -  fH)}  dpx(x) 

=  Jx  \\T*\\c(H)(TMX  -  fn),  fX  ~  fn)H  dpx(x) 
<K(r(fx-fn)Jx-fn)n 

=  «  y/T( fx~fn )  J 

H 

=  kA(X)  =  <j\. 

Observing  that 

1  ^ 

E[6]  =  T(fx  -fn)  jY.  &(*<)  =  Tx(/A  "  M, 

i— 1 

Proposition  4  applied  to  £2  gives 

|(T  -  TJ(/a  -  /«)||h  <  2 log{6/„)  j  .  (36) 


with  probability  greater  than  1  —  r]/3.  Replacing  (33),  (36)  in  (31),  for  all  £  E  N 
satisfying  (29)  it  holds 


<52(A,  z)  <  81og2(6/?y) 


( 4/t2£>(A)  Kv4(A)\ 

V  £2  A  +  l\  ) 


(37) 


with  probability  greater  than  1  —  2rj/3. 

Step  3:  probabilistic  bound  on  <Si(A,  z).  Clearly 


5r(A,z)<  >/f(rx  +  A)"1(T  +  A)3  2  (T+ A)"3  (Az*y  -  Tx/W)  (38) 

'H 

Step  3.1:  bound  on  \/T(Tx  +  A)_1(T  +  A) 2  .  Clearly, 

C(H) 

V/T(Tx  +  A)“1(T  +  A)5  =  Vf{T  +  A)-3  {/-  (T  +  A)-3(T-TX)(T+  aH}"1. 
Spectral  theorem  ensures  that  VT(T  +  A)“5  <  1  so,  reasoning  as  in  Step 

2.1,  ^  !  '7!' 

V/T(TX  +  A)“1(T  +  A)  2  <2  (39) 

£(W) 
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provided  that 

(T  +  A)-^(T-Tx)(T  +  A)-5  <\.  (40) 

If  B  =  (T  +  A )“3(T  -  TX)(T  +  A)“5,  then 


^llL(H)  =  Tr  ((T  +  A)_1(^  -  TX)(T  +  A )“1(T  -  Tx)) 

=  ((T  +  A)_1(T  -  Tx),  ((T  +  A)_1(T  -  Tx))*)^ 

<  ||(T  +  A)-1(T  -  Tx)||£a(w)  |  ((T  +  A)_1(T  -  Tx))* 

=  ||(T  +  A)-1(T-TX)||^(W), 


and,  for  all  £  G  N  satisfying  (29),  (35)  ensures  that  (40)  holds  with  probability 

1  -  277/3. 

Step  3.2:  bound  on  (T  +  A)“5  (ylz*y  —  Tx/^)  .  Let  £3  :  X  x  Y  — ►  7i  be  the 

random  variable 

6(a:,y)  =  (T  +  A )~^Kx(y  -  fH(x)) . 

The  definition  of  M  and  the  polar  decomposition  of  Kx  =  \/T[cUx,  where  Ux  is  a 
partial  isometry,  give 


Il6(^y)llw<  (T  +  A)-2 


Kx\\c(Y,H)  < 


Hs 

2 


almost  surely.  Let  Px.y  =  (-,y  -  fn{x))Y(y-fn(x))  with  \\Px,y\\c{Y)  =  h  “  fn(x)\\Y2  < 
M,  then 


E[||6llw2]=  [  Tr  (K*(T  +  \)~l KxPx^y)  dpx(x) 

JXxY 

<  J  \\Px,y\\c(Y)  Tr  ((^  +  ^ylTx)  dpX{x ) 

<  MTr[(T  +  A)_1T]  =  MAf(X)  =  of, 

where  (34)  is  used  replacing  TL  with  Y .  Equation  (17)  gives 

E[^]  =  (T  +  X)-12(A*y-Tfn)  =  0, 


so  Proposition  4  applied  to  £3  ensures 


(T  +  A)-5  (Az*y  -  Txfn)  <  2  log(6/r/) 

TL 


with  probability  greater  than  1  —  77/3.  Replacing  (39),  (41)  in  (38) 


<Si(A,z)  <  32 log2 (6/?/) 


(\  kM 

+  e  )  • 


(41) 


(42) 
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with  probability  greater  than  1  —  r/. 
Replacing  bounds  (37),  (42)  in  (30). 


/[/zA]  -  I[fn]  <  3v4(A)  +  81og2(6/77)  (4K^A) 


kA(  A) 
t\ 


+ 


16  kM  4MAA(A)\ 

£2x  +  e  ) 


and  (28)  follows  by  bounding  the  numerical  constants  with  128. 
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